1 File types

1.1 R Script

  • Text file containing a set of commands and comments (#)
  • Script can be saved and used later to re-execute the saved commands
  • Script can be edited so you can execute a modified version of the commands.
  • SAves as .R file

1.2 R Shiny Web App


1.3 R Markdown

  • Lets you mix text with code for R, Python, etc
  • Text is in R Markdown
  • Saves as .Rmd file
  • Outputs to HTML, docx, LaTeX (PDF)
  • Uses ``` for R chunks

1.4 R Sweave

  • Lets you mix text with code for R, Python, etc
  • Has mostly been upgraded to knitr
  • Saves as .Rnw file
  • Outputs to LaTeX (PDF)
  • Uses <<>>= and @ for R chunks

1.5 Knitr

  • Processes both Markdown and Sweave files

2 Online resources


3 RStudio layout


4 Arithmetic

The pound sign (#) is used for comments in R. Below are some of the most common syntax for arithmetic in R.

# Addition
5+10
## [1] 15
# Subtraction
100-6
## [1] 94
# Multiplication
4*9
## [1] 36
# Division
(5 + 5) / 2
## [1] 5
# Exponentiation
2^4
## [1] 16
# Modulo
18 %% 5
## [1] 3

5 Variable assignment

A basic tool in statistical programming is called a variable. A variable allows users to store a value (e.g. 7) or an object (e.g. a function description). You can then the name of the variable later on to easily access the value or the object that is stored within it. When creating a variable in R, use the <- or = grammar with the variable name on the left and the variable value on the right.

# Assign the value 19 to x
x <- 19

# Print out the value of the variable x
x
## [1] 19

We can perform arithmetic on variables.

# Assign the value 19 to x
y <- 7

# Add the values of x and y and store the result in z
z <- x + y

# Print the value of z
z
## [1] 26

You can store a function as a variable. Here we create a function that adds two values. The name of the function is sumTwoValues. The function requires the user to input two variables with values (two “input parameters”). The function then adds the values of the two variables the user input and stores into a variable called sum. The value of sum is then returned to the user.

sumTwoValues <- function(x, y) {
  sum <- x + y
  return(sum)
}

Now, if we call the function name, it will simply list the code of the function.

sumTwoValues
## function(x, y) {
##   sum <- x + y
##   return(sum)
## }

To actually use the function, we must call the function name (sumTwoValues) and input the two required variables (x and y).

x <- 1 
y <- 2
sumTwoValues(x, y)
## [1] 3
sumTwoValues(100,80)
## [1] 180

Here is another example of creating a function in R that converts fahrenheit to celsius. This function requires one input parameter temp_F from the user.

fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

We can call the function as follows:

temp_F <- 30
fahrenheit_to_celsius(temp_F)
## [1] -1.111111

Or as follows:

myTempF <- 30
fahrenheit_to_celsius(myTempF)
## [1] -1.111111

Now, we can run various variations of the same function using different input paramater values of interest:

fahrenheit_to_celsius(30)
## [1] -1.111111
fahrenheit_to_celsius(40)
## [1] 4.444444
fahrenheit_to_celsius(50)
## [1] 10
fahrenheit_to_celsius(60)
## [1] 15.55556
fahrenheit_to_celsius(70)
## [1] 21.11111

We won’t go into details of loops in R. These can be found in many online tutorials. But for demonstrative purposes, the code above could be further reduced as follows:

for (t in seq(from = 30, to = 70, by = 10)){
   print(fahrenheit_to_celsius(t))
}
## [1] -1.111111
## [1] 4.444444
## [1] 10
## [1] 15.55556
## [1] 21.11111

How did this work? First, let’s look at the seq() function. This is built-in R function. If we run the name of the function, we will only see the code of the function, which is not helpful.

seq
## function (...) 
## UseMethod("seq")
## <bytecode: 0x7f9b826e1b60>
## <environment: namespace:base>

Instead, let’s run a help() command on the seq() function. We can do this by running either of the two below commands:

help(seq)
?(seq)

Both below code perform the same task and create the values 30, 40, 50, 60, 70. Note that you do not have to explicitly call the input parameter name for the command to work.

seq(from = 30, to = 70, by = 10)
## [1] 30 40 50 60 70
seq(30, 70, 10)
## [1] 30 40 50 60 70

6 Basic data types in R

R uses various data types. Some of the most basic types to get started are characters, numerics, integers, and logicals.

# Set my_numeric to be 9.3
my_numeric <- 9.3

# Set my_character to be "mouse"
my_character <- "mouse"

# Set my_logical to be TRUE
my_logical <- TRUE

We can update the variable we stored to have new values.

# Change my_numeric to be 10
my_numeric <- 10

# Change my_character to be "mars"
my_character <- "mars"

# Change my_logical to be FALSE
my_logical <- FALSE

The str() function is very useful in R to help you understand what data type and values are associated with variables.

# Run str() on my_numeric
str(my_numeric)
##  num 10
# Run str() on my_character
str(my_character)
##  chr "mars"
# Run str() on my_logical
str(my_logical)
##  logi FALSE

Note that you can determine all variables you have stored in your current R session by typing:

ls()
##  [1] "fahrenheit_to_celsius" "my_character"          "my_logical"           
##  [4] "my_numeric"            "myTempF"               "sumTwoValues"         
##  [7] "t"                     "temp_F"                "x"                    
## [10] "y"                     "z"

You can remove all variables in your current R session using the rm(list=ls()) function:

rm(list=ls())

7 Basic data structures in R

R has several data structures. These include atomic vector, list, matrix, factors, and data frame.

7.1 Vectors

Vectors are collections of elements that are usually of mode character, logical, integer or numeric. We can create an empty vector using vector(). (By default the mode is logical.)

# an empty 'logical' (default) vector
vector()
## logical(0)
# a vector of mode 'character' with 3 elements
vector("character", length = 3)
## [1] "" "" ""
 # the same results, but using the constructor directly
character(3)
## [1] "" "" ""

You can also generate vectors by directly specifying their contents. R will then infer the appropriate mode of storage for the vector.

# create a vector x of mode numeric
x <- c(1, 2, 3)
# create a vector y of mode logical Using TRUE and FALSE
y <- c(TRUE, TRUE, FALSE, FALSE)
# create a vector z of mode character using quoted text
z <- c("Nailil", "Quang", "Kaness")

In addition to str(), you can also examine vectors using typeof(), length(), class().

str(z)
##  chr [1:3] "Nailil" "Quang" "Kaness"
typeof(z)
## [1] "character"
length(z)
## [1] 3
class(z)
## [1] "character"

You can add elements to vector using the combine (c()) function.

z <- c(z, "Sunanda")
z <- c("Fujita", z)

R allows missing data in vectors. Missing data are represented as NA (Not Available). The function is.na() informs which elements of vectors are missing data, and the function anyNA() returns TRUE if the vector contains at least one missing value.

x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")

is.na(x)
## [1] FALSE  TRUE FALSE FALSE  TRUE
is.na(y)
## [1] FALSE FALSE FALSE FALSE FALSE
anyNA(x)
## [1] TRUE
anyNA(y)
## [1] FALSE

You can mix different types within a vector In that case, R will create a vector with a mode that seems to best accommodate all the elements it contains. Conversion between modes of storage is known as “coercion”. For example, gues what the following mixed-type input vectors end up as their storage.

mixX <- c(3.3, "p")
mixY <- c(TRUE, 5)
mixZ <- c("q", TRUE)

See if your guess is correct!

str(mixX)
##  chr [1:2] "3.3" "p"
str(mixY)
##  num [1:2] 1 5
str(mixZ)
##  chr [1:2] "q" "TRUE"

7.2 Matrices

In R, matrices are an extension of numeric or character vectors. They are simply vectors with dimensions (the number of rows and columns). As with vectors, the elements of a matrix must be of the same data type.

m <- matrix(nrow = 2, ncol = 2)
m
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA

We can investigate our matrix using various attribute functions (like dim(), class(), and typeof().

dim(m)
## [1] 2 2
class(m)
## [1] "matrix" "array"
typeof(m)
## [1] "logical"

We can fill this matrix in with values, which is done in R column-wise.

m <- matrix(1:6, nrow = 2, ncol = 3)

Another way to fill matrices is to bind columns or rows using rbind() and cbind() (“row bind” and “column bind”, respectively).

x <- 4:6
y <- 10:12
cbind(x, y)
##      x  y
## [1,] 4 10
## [2,] 5 11
## [3,] 6 12
mdat <- rbind(x, y)

Elements of a matrix can be obtained by specifying the indices along each dimension (e.g. “row” and “column”) in single square brackets.

mdat[1,3]
## x 
## 6

7.3 Lists

Lists act as containers in R. Unlike vectors, elements of lists can be more than one mode and can contain any mixture of data types. Lists are sometimes refered to as “generic vectors”, because the elements of a list can by of any type of R object. Lists can even contain lists elements within themselves (“nested lists”). These properties makes lists fundamentally different from vectors.

xList <- list(1, "a", TRUE)
xList
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE

Elements within lists can be obtained using double square brackets.

xList[[2]]
## [1] "a"

List elements can be named.

nameXList <- list(myNum = 1, myChar = "t", myBool = TRUE)
nameXList
## $myNum
## [1] 1
## 
## $myChar
## [1] "t"
## 
## $myBool
## [1] TRUE
names(nameXList)
## [1] "myNum"  "myChar" "myBool"

We can obtain the element values within the named list by calling the name with the $ notation.

nameXList$myChar
## [1] "t"

Lists can be very helpful inside functions. This is because functions in R can only return a single object. Therefore, you can “concatentate” (staple) together numerous results values into a single object that the function can then return.


7.4 Data frames

A data frame is one of the most important data types in R and is often used for tabular information in statistics. A data frame is a list, where every element has the same length (i.e. data frame is a “rectangular” list).

  • Data frame can be created by important data into R (read.csv() and read.table())
  • If all columns in a data frame are the same type, a data frame can be converted to a matrix using as.matrix()
  • Data frame can be created in R with data.frame() function.
  • Rownames are often automatically generated as 1, 2, …, n. Row numbers can become inconsistent when data frame is reshuffled or subset.

We can create a data frame below.

df <- data.frame(id = letters[1:10], var1 = 1:10, var2 = seq(11,30,2))
df
##    id var1 var2
## 1   a    1   11
## 2   b    2   13
## 3   c    3   15
## 4   d    4   17
## 5   e    5   19
## 6   f    6   21
## 7   g    7   23
## 8   h    8   25
## 9   i    9   27
## 10  j   10   29

There are many useful data frame functions:

  • head() - shows top 6 rows
  • tail() - shows bottom 6 rows
  • dim() - returns dimensions of data frame (number of rows and columns)
  • nrow() - number of rows
  • ncol() - number of columns
  • str() - structure of data frame - name, type and preview of data in each column
  • names() or colnames() - both show the names attribute for a data frame
  • sapply(dataframe, class) - shows the class of each column in the data frame
head(df)
##   id var1 var2
## 1  a    1   11
## 2  b    2   13
## 3  c    3   15
## 4  d    4   17
## 5  e    5   19
## 6  f    6   21
tail(df)
##    id var1 var2
## 5   e    5   19
## 6   f    6   21
## 7   g    7   23
## 8   h    8   25
## 9   i    9   27
## 10  j   10   29
dim(df)
## [1] 10  3
nrow(df)
## [1] 10
ncol(df)
## [1] 3
str(df)
## 'data.frame':    10 obs. of  3 variables:
##  $ id  : chr  "a" "b" "c" "d" ...
##  $ var1: int  1 2 3 4 5 6 7 8 9 10
##  $ var2: num  11 13 15 17 19 21 23 25 27 29
names(df)
## [1] "id"   "var1" "var2"
sapply(df, class)
##          id        var1        var2 
## "character"   "integer"   "numeric"

Since data frames are rectangular, elements of data frames can be accessed by specifying the row and the column index in single square brackets.

df[2, 3]
## [1] 13

Since data frames are special forms of lists, we can obtain columns using the list notation, i.e. either double square brackets or a $.

df[["var2"]]
##  [1] 11 13 15 17 19 21 23 25 27 29
df$var2
##  [1] 11 13 15 17 19 21 23 25 27 29

If you are used to working with Excel-like format, you can also “View” the data frame in that format as follows:

View(df)

Since statisticans often work with data frame structures, packages have been written that allow for additional smooth use and manipulation of data frames in addition to base R functions. One popular package for working with data frames in dplyr.


8 Reading and writing data into R

Scientists often store data in Excel spreadsheets. There are various R packages that can help R users access data from Excel spreadsheets (XLConnect, gdata, RODBC, RExcel, and xlsx). However, many users find it simpler to save their spreadsheets in comma-separated values files (.CSV) and then use base R functionality to read and manipulate the data.

We can read in a CSV file (Labmates.csv) that contains each of our names and spirit animals. First, make sure you are located in the directory where the file is located. You can do this using the “set working directory” command (setwd()) or by manually choosing “Set As Working Directory” in your RStudio space. Then, you should be able to run the command:

my_data <- read.csv("data/Labmates.csv")

It may be easier as well to run the following command, which will open a GUI that allows you to select the file. location and the file.

my_data <- read.csv(file.choose())

Looks like we each have a spirit animal. Vinh san is not happy. He really loves cats. Plus, he really does not want to have a pangolin as a spirit animal these days. Let’s switch the spirit animal of Vinh and Lindsay (who originally had cat). Then, let’s also add a column to indicate who attended the R crash course today.

my_data[2,2] = "Cat"
my_data[7,2] = "Pangolin"
my_data$Attendance = c(1,0,1,0,1,0,1)

Now, we can save/write our updated file to a desired location.

write.csv(my_data, file = 'data/LabmatesAttendance.csv')

If you do not like the first column with the indices, you can save as follows:

write.csv(my_data, file = 'data/LabmatesAttendance.csv', row.names = FALSE)

Note there is also a very flexible storage format in R in which you can save any type of object (not just CSV or data frame). This can be done with the RDS file format using the saveRDS() and the readRDS() functions.

exampleVariable = saveRDS(my_data, file = "data/LabmatesAttendance.Rds")

We can now read in the object and, if desired, set it to a new variable attendanceRDS.

attendanceRDS <- readRDS("data/LabmatesAttendance.Rds")
str(attendanceRDS)

9 Example data in R

There are numerous built-in datasets in R that you can use to practice and/or to create “minimal working examples” when you cannot use your real data. You can see a list of example R datasets by typing:

data()

One example dataset is mtcars. We can load it and examine it.

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

10 Graphics in R

There are numerous basic plotting types in R that do not require using extra packages. For each plot, there is often a more powerful equivalent using ggplot2(), which we need to first install. The ggplot2() package is on CRAN. For packages installed on CRAN, we can install using the install.packages() function. We can then read the package into our current R session using the library() function.

install.packages("ggplot2")
library(ggplot2)

10.1 Scatterplots

In base R, we can perform using:

plot(x = mtcars$wt, y = mtcars$mpg)

It can also be performed in ggplot2 using qplot():

library(ggplot2)
qplot(x = mtcars$wt, y = mtcars$mpg)


10.2 Line plots

We can create a line plot based on the pressure dataset in R.

plot(x = pressure$temperature, y = pressure$pressure, type = "l")

We can instead pass the argument type = “s” to produce a stepped line chart:

plot(x = pressure$temperature, y = pressure$pressure, type = "s")

We can use qplot() to get similar results by using the geom argument. In graphics, geom are geometric objects (lines, points, etc.) that visually represent the data. In this case, we can represent the data using a line and then also points:

qplot(temperature, pressure, data = pressure, geom = "line")

qplot(temperature, pressure, data = pressure, geom = "step")

qplot(temperature, pressure, data = pressure, geom = c("line", "point"))

There are countless other plots that can be made in base R and ggplot2, including box plots, bar plots, histograms, stem and leaf plots, mosaic plots. There are great online resources to practice making plots and entire textbooks. As shown earlier in this tutorial, there are also cheat sheets available in R Studio. With ggplot2 graphics, you can also use additional packages to render them interactive fairly easily. You can see neat examples of this from the function ggplotly here.

11 Downloading packages from multiple sources

There are various ways to download R packages from online resources. Three common ways are on CRAN, GitHub, and Bioconductor. As an example, the ggplot2 package is available on CRAN here. Hence, we could install the CRAN version using the code we already saw:

install.packages("ggplot2")

You can also install the developmental version from GitHub (if there is one). It seems ggplot2 has a GitHub repository here. The functionality to install a repository from GitHub into R can be done with the devtools package, which is on CRAN. Here, we first install the devtools package, read it into our R session, and then use one of its functions (install_github()) to install the ggplot2 version on GitHub.

# install devtools package
devtools::install_github("r-lib/devtools")
library(devtools)

devtools::install_github("tidyverse/ggplot2")

Some packages, especially ones that relate to bioinformatics, are on Bioconductor. For example, the RNA-seq analysis packages DESeq2 and edgeR are on Bioconductor. If you try to install these packages using the CRAN functionality, you will get an error (package ‘edgeR’ is not available (for R version 4.0.2)).

install.packages("edgeR")

Instead, to install a Bioconductor package into R, you will need the following type of code:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("edgeR")

12 Creating a minimal working example

Sometimes you cannot troubleshoot an error in R, even after thinking carefully and reading about potential underlying causes. One thing you may want to do in that case is ask a colleague or post on StackOverflow. Sometimes you cannot show your real data (due to privacy issues). Moreover, on StackOverflow, you cannot upload any data. Hence, you often need to create a “minimal working example”: that is, a simulated dataset that has the same data types and formats as your real data that can be used to simulate the error. Let’s look at an example of how to do this.

Say, you are working with a sensitive dataset. Let’s read it in first.

senDF <- readRDS("data/sensitiveData.Rds")

We can see this dataset contains patient names, sound gene count information, and a phenotype that could be sensitive (cancer status, mental illness, etc).

str(senDF)
## 'data.frame':    100 obs. of  3 variables:
##  $ patient  : chr  "Patient1" "Patient2" "Patient3" "Patient4" ...
##  $ geneCount: Factor w/ 62 levels "5","6","7","8",..: 5 11 54 12 34 48 23 59 30 55 ...
##  $ phenotype: chr  "Yes" "No" "No" "Yes" ...

Say you wanted to sum up all the gene counts in this dataset. Usually, this can be achieved easily by applying the sum() function to the corresponding column. However, when we try to do this, we receive an error (“‘sum’ not meaningful for factors”):

sum(senDF$geneCount)

If you are unable to figure out this error and want to consult colleagues or StackOverflow, you will need to provide code to them that creates a minimal working example data frame (mweDF) that has the same data types as your sensitive data frame (senDF).

patient = paste0("Patient", 1:100)
geneCount = as.factor(sample(1:100, 100, replace=TRUE))
phenotype = sample(c("Yes","No"), 100, replace = TRUE)
mweDF = data.frame(patient = patient, geneCount = geneCount, phenotype = phenotype)
str(mweDF)
## 'data.frame':    100 obs. of  3 variables:
##  $ patient  : chr  "Patient1" "Patient2" "Patient3" "Patient4" ...
##  $ geneCount: Factor w/ 65 levels "1","2","3","5",..: 65 2 63 23 50 37 30 1 56 48 ...
##  $ phenotype: chr  "No" "No" "No" "Yes" ...

Then, your colleagues can simulate your error by running the sum() function on your simulated data frame.

sum(mweDF$geneCount)

If they have more experience, they can then provide a suggestion. In this case, to change your geneCount column to be of numeric type.

mweDF$geneCount = as.numeric(mweDF$geneCount)

Indeed, we now no longer get that error and can successfully sum that column.

sum(mweDF$geneCount)
## [1] 3380

13 Challange with corrplot

Before her rotation ends, Sunanda wants to replicate a figure she found in a journal article that was created using the corrplot package. Based on this brief R crash course, are you able to replicate the first few figures from the corrplot vignette?